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Abstract 

The problem of the order of the fluctuation of the Longest Common Subsequence (LCS) of 
two independent sequences has been open for decades. There exist contradicting conjectures 
on the topic, [T] and [2]. Lember and Matzinger [3] showed that with i.i.d. binary strings, the 
standard deviation of the length of the LCS is asymptotically linear in the length of the strings, 
provided that and 1 have very different probabilities. Nonetheless, with two i.i.d. sequences 
and a finite number of equiprobable symbols, the typical size of the fluctuation of the LCS 
remains unknown. In the present article, we determine the order of the fluctuation of the LCS 
for a special model of i.i.d. sequences made out of blocks. A block is a contiguous substring 
consisting only of one type of symbol. Our model allows only three possible block lengths, 
each been equiprobable picked up. For i.i.d. sequences with equiprobable symbols, the blocks 
are independent of each other. In order to study the fluctuation of the LCS in this model, 
we developed a method which reformulates the fluctuation problem as a (relatively) low 
dimensional optimization problem. We finally proved that for our model, the fluctuation of 
the ength of the LCS coincides with the Waterman's conjecture [2]. We belive that our method 
can be applied to any other case dealing with i.i.d. sequences, only that the optimization 
problem might be more complicated to formulate and to solve. 

1 Introduction 
1.1 Motivation 

In general trough this paper, X and Y will denoted two finite strings over a finite alphabet 
S. A common subsequence of X and y is a subsequence which is a subsequence of X as well 
as of Y. A Longest Common Subsequence of X and Y (denoted simply by LCS of X and 
Y, or only LCS when the context is clear enough) is a common subsequence of X and Y of 
maximal length. 
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Let us motivate the study of the LCS of two string with an example: let x = ACGTAGCA 
and y = ACCGTATA two sequences over the finite alphabet S = {A, C, G,T}. A common 
subsequence of x and y could be z = ATA. For example, the string z can be obtained from 
both X and y by just deleting some letters. We can represent the common subsequence z as 
an alignment with gaps (a gap is denoted by '-'). The letters which are not in the subse- 
quence get aligned with gaps, so that the subsequence has aligned the common letters of both 
sequences. The common subsequence z = ATA can correspond to the following alignment: 
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The representation of a subsequence as an alignment with gaps is not necessarily unique. 
However, each alignment with gaps defines exactly one common subsequence. We are inter- 
ested only on alignments which aligns same-letter pairs or letters with gaps. In this paper, 
an alignment which aligns a maximum number of letter pairs of x and y is called optimal 
alignment. The subsequence defined by an optimal alignment is hence an LCS. The LCS of 
X and y is LCS(x,y) = AGGTAA and corresponds to the optimal alignment: 
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In Bioinformatics (for instance [HIS]), one of the main problems is to decide if two sequences 
are related or not. If they are, it probably means that they evolved from a common ancestor. 
So, if they are related they should look somehow similar. Biologists try to determine which 
parts are related by finding an alignment which aligns the related parts. In our current 
example, the sequences x = AGGTAGG A and y = ACGGTATA are somehow similar, but 
if we compare them letter by letter the great similarity does not become obvious: 
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In the alignment without gaps 1.3 we aligned mostly non-corresponding letter pairs, from 
where we obtained only 3 aligned same-letter-pairs: (from left to right) the first, the second 



and the last pair. This is much less than what our optimal LCS alignment 1.2 achieved. A 



possible explanation why 1.3 is worse in looking for similarities than 1.2 is that some letters 
"got lost" in the evolution process, so that they are present only in one of the two sequences, 
so it is more useful to consider alignments with gaps instead to look for similarities. Longest 
Common Subsequences and Optimal Alignments are the main tools in computational 

biology to recognize when strings are similar. A relatively long LCS indicates that the strings 
are related, but how long does the LCS need to be to imply relatedness? Sequences which are 
not related are stochastically independent. Could it be that independent stochastic strings 
have a long LCS because of bad luck? To understand this questions, we need to figure out the 
size of the fluctuation of the LCS of independent strings. We are interested in the asymptotic 
of the fluctation since we mainly consider long sequences. 
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1.2 Notation and history 



Let X = X1X2 ■ ■ ■ Xn and Y = Y1Y2 ... 1^ be two stationary random sequences which are in- 
dependent of each other, both drawn from the same finite alphabet S. Let Ln ■= |LCS(X, Y)\ 
denote the length of the LCS of X and Y. A simple sub-additivity argument pj shows that 
the expected length of the LCS divided by n converges to a constant: 

lim =: 7 > 0. 

n— >oo n 

The constant 7 depends on the distribution of X and Y. But even for such simple cases as 
i.i.d. sequences with equiprobable symbols, the exact value of 7 is not known. Chvatal-Sankoff 
[1] derived upper and lower bounds for 7. These bounds were further refined by Baeza- Yates, 
Gavalda, Navarro and Scheihing [8], Deken [9], Dancik-Paterson |10|lll] and finally Durringer, 
Hauser, Martinez, Matzinger [121 113j . The asymptotic value of the rescaling coefficient 7 as 
the number of symbols (the size of S) goes to infinity was determined by Kiwi, Loebl and 
Matousek [13] ■ On the other hand, the speed of convergence was obtained by Alexander [15] 
by using techniques from percolation theory. 

The order of magnitude of the fluctuation of Ln is unknown for situations as simple as i.i.d. 
sequences of equiprobable letters. In p] Waterman conjectured that, in many situations, the 
fluctuation of the LCS is of order square root of the length times a constant: 

VAR[L„] = e(n). (1.4) 

Here the order 0(n) means that there exist constants < a < 6 such that 

an < VAR[L„,] < bn 

for all n G N (the constants a and b might depend on the distribution of X and Y). 



So far, Lember and Matzinger [3j proved the order given in 1.4 for binary i.i.d. sequence, 
but when the probability of 1 is much less than the probability of 0. Durringer, Lember 
and Matzinger [16] obtained also the same order when one sequence is non-random, binary 
and periodic whilst the other binary sequence is i.i.d. Bonetto and Matzinger [17] proved 
also the same order when the first sequence is drawn from a three letter alphabet {0, l,a} 
whilst the second sequence is binary. Finally, Houdre and Matzinger [18] proved also the 
same orden when the two sequence are binary and i.i.d. but the scoring function which de- 
fines the alignment is such that one letter has a somewhat larger score than the other letter. 
Recall that in [19], Steele proved that there exists a constant c > not depending on n 
such that VAR[L„] < c • n, regardless of the alphabet S. This means that one only needs to 
find good lower bounds for the variance of Ln in order to find results on the fluctuation of Ln- 



The LCS problem can be formulated as another popular open problem in probability the- 
ory, namely the Last Passage Percolation problem with correlated weights. The equivalence 
is as follows: let the set of vertices in our percolation setting be V := {0, 1, 2, . . . , n} x 
{0, 1,2,..., n}. The set of oriented edges E G V x V contains horizontal, vertical and di- 
agonal edges. The horizontal edges are oriented to the right, whilst the vertical edges are 
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oriented upwards. Both have unit length. The diagonal edges point up-right at a 45-degree 
angle and have length \f2. Hence ^ := { (w, w + ei), {y + 62, w), (v, v + 63) | G F }, where 
ei := (1,0), 62 := (0, 1) and 63 := (1, 1). With the horizontal and vertical edges, we associate 
a weight of 0. With the diagonal edge from to {i + l,j + 1) we associate the weight 1 
if = Yj^\ and —00 otherwise. In this manner, we obtain that the length of the LCS 

denoted by L„ = |LCS(XiX2 . . . X„, Y1I2 • • • ^n)| is equal to the total weight of the heaviest 
path going from (0,0) to (n, n). Note that the weights on our 2-dimensional graph are not 
"truly 2-dimensional" : they depend only on the one dimensional sequences X = Xi . . . X^ 

andy = yi...y„. 

The LCS problem is also related to the problem of the Longest Increasing Subsequence 
of a random permutation (for short only LIS), namely the LIS can be seen as the LCS of 
two sequences where one is a sequence of randomly permuted numbers and the other is the 
sequence of increasing integers. Take for example 5 cards numerated from 1 to 5. Mix them 
thoroughly (until each permutation is equally likely). Then, lay them down face up in one 
line on a table. For example, you could obtain the permutation: 

2 3 15 4 

A longest increasing subsequence here is 2,3,5. We designate by the length of the longest 
increasing subsequence of such a random permutation, so in our case h = 3. Note that the 
length of the LIS is equal to the length of the LCS of the permutation and the sequence of 
increasing numbers. In our example /s = |LCS(23154, 12345)|. So, thanks to this relation 
and the recently tremendous breakthrough on the study of the LIS problem, many people 
was optimistic about finding a solution to the LCS problem by applying the new techniques 
from the LIS problem. Unfortunatelly, nobody has succeded so far in doing that. Moreover, 
we now belive that the LCS problem and the LIS problem are essentially from different 
classes though they have some features in common, for instance that both can be seen as 
passage percolation models, since the LIS problem is asymptotically equivalent to a special 
last passage percolation process on a Poisson graph. Let us recall some basic results about 
the LIS problem. In ^20j, Baik, Deift and Johansson proved that 

In — 2-y/n 
ni/6 

converges in distribution as n —t- 00 to a so called Tracy- Widom distribution (here In denotes 
the length of the longest increasing sequence of a random permutation drawn from the sym- 
metric group Sn with the uniform distribution). This limiting distribution can be obtained 
via the solution of the Painleve II equation. It was first obtained by Tracy and Widom 
|2H [22] in the framework of Random Matrix Theory where it gives the limit distribution for 
the (centered and scaled) largest eigenvalues in the Gaussian Unitary Ensemble of Hermitian 
matrices. The problem of the asymptotic of In was first raised by Ulam [23j. Substantial 
contributions to the solution of the problem have been made by Aldous and Diaconis |24j . 
Hammersley [25], Logan and Shepp [26], Vershik and Kerov (Vershik/Kerov 1977 Soviet math 
dokl). 
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Coming back to the reason we belive make the LCS problem and the LIS problem essen- 
tially different, we can say the following: in the LIS case, the order of the fluctuation is power 
1/3 of the expectation and not square root. For the LCS case, the expectation is of order 
n. So, if the fluctuation was also a third power of the expectation, then we would have that 
VAR[L„] should be of order linear in n^/^. This is the order of magnitude conjectured by 
Chvatal-SankofF [T] for which several people have some heuristic proofs. We believe that this 
order is wrong for the LCS-problem, based in all the above cites (and the present article) 
which confirmed Waterman's conjecture in many cases. However, for short sequences (small 
n) the order conjectured by Chvatal-Sankoff (which corresponds to the order of the fluctua- 
tion of the LIS) might be what one approximately observes in simulations. We believe that 
for short sequences, the underlying percolation structure shared between the LCS problem 
and the LIS problem make the two fluctuation look the same though the situation changes 
for large n: for short sequences, the correlation of the weights in the LCS problem has no 
strong effect and the system behaves as if the weights would be independent, as in the Poisson 
graph situation. So far, this arguments have not been rigoruosly proved, turning them in our 
opinion into attractive open questions in the area. 

2 Model and main ideas 

Let / > be an integer parameter. Let Bxi, Bx2, ■ ■ ■ and Byi,By2, ... be two i.i.d. sequences 
independent of each other such that: 

P{Bx^ = l-l) = 'P{Bxi = l) = 'P{Bx^ = l + l) = 1/3 

P{BY^ = l-l)=P{BYi = l)=F{BYi=l + l) = 1/3. 

We call the runs of O's and I's blocks. Let X°° = X1X2X3 ... be the binary sequence so 
that the i-th block has length Bxi where Xi is choosen with probability 1/2 or 1 with 
probability 1/2. Similarly let Y°° = Y1Y2Y3 ... be the binary sequence so that the z-th block 
has length BYi and Yi is choosen with probability 1/2 or 1 with probability 1/2. 

Example 2.1 Assume that Xi = 1 and Bxi = 2, Bx2 = 3 and Bx3 = 1- Then we have 
that the sequence X°° starts as follows X°° = 1100010 • • • meaning that in X°° the first block 
consists of two I's, the second block consists of three O's, the third block consists of one I's, 
etc. 

Let X denote the sequence obtained by only taking the first n bits of X°°, namely X = 
X1X2X3 . . . Xn and similarly Y = Y1Y2Y3 . . . Yn- Let L„ denote the length of the LCS of X 
and Y, Ln := |LCS(X,y)|. 

The main result of this paper states that for / large enough, the order of the 
fluctuation of L„ is n: 

Theorem 2.1 There exists Iq so that for all I > Iq we have that: 

VAR[L„] = e(n) 

for n large enough. 
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We show that the above theorem is equivalent to proving that "a certain random modification 
has a biased effect on L„". This is a technique with similar approches in other papers (for 
instance see [3] , [17| ) . So the main difficulty is actually proving that the random modification 
has typically a biased effect on the LCS. This random modification is performed as follows: 
we choose at random in X a block of length I — 1 and at random one block of length / + 1, 
this means that all the blocks in X of length I — 1 have the same probability to be chosen 
and then we pick one of those blocks of length Z — 1 up and also that all the blocks in X of 
length / + 1 have the same probability to be chosen and we pick one of those blocks of length 
/ + 1 up. Then we change the length of both these blocks to I. The resulting new sequence 
is denoted by X. Let L„ denote the length of the LCS after our modification of X. Hence: 

Ln := |LCS(X,y)|. 

If we can prove that our block length changing operation has typically a biased effect on the 
LCS than the order of the fluctuation of L„ is ^/n. This is the content of the next theorem: 



Theorem 2.2 Assume that there exists e > and a > not depending on n such that for 
all n large enough we have: 

P ( E[L„ - L„|X, y] > e ) > 1 - exp(-n"). (2.1) 

Then, 

VAR[L„] = e(n) 

for n large enough. 

The above theorem reduces the problem of the order of fluctuation to proving that our ran- 
dom modification has typically a higher probability to lead to an increase than to a decrease 
in score. The proof of this result is not included in the present article for shortness reasons, 

is true. 



IS 



though all the details are in [27]. In all what follows, we assume that theorem 2.2 
The next step is to ask: how can we prove, in our block model, that the condition 2.1 
satisfied? In theorem |2.3[ we see that the condition |2.1| can be obtained from the positive 
solution of a minimizing problem. This minimizing problem has to do with the proportion of 
symbols which build up the LCS, been placed on a 9 dimensional space. By using Lagrange 
multiplyers techniques, we are able to further reduce it to a parametrized 3 dimensional op- 
timization problem. Furthermore, we numerically and graphically verify that the positive 
minimum condition is already verified for / > 5, which implies that VAR[L„] = Q{n) holds 
already for / = 6. Details on the solution of this minimization problem can be found in [28] . 

The article is organized as follows: in what is left of section [2], we explain how to relate 
the effect of the random modification with a constrained optimization problem on the pror- 
portion of symbols used to build up the LCS and how this relation is used to prove theorem 
2.1 In section [3j we discuss some combinatorial aspects of aligned blocks in optimal align- 



ments especific for this block model. Finally, in section |4] we devote ourself to prove theorem 
[231 
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2.1 Random modification and proportion of aligned blocks 

Let us next look, with the help of an example, when the random modification introduces an 
increase or a decrease in the score: 

Example 2.2 Let us suppose I = 3. Let us take two sequences x = 00110011110000111 and 
y = 0011100001100001111. An optimal alignment (in the sense of the Example would 
he: 
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(2.2) 



In this example no Mock gets left out completely. By this we mean that no block is only aligned 
with gaps. The first block of x is aligned with the first block of y. The second block of x is 
aligned with the second block of y. By this we mean that all the bits from the second block of 
X are either aligned with bits of the second block of y or with gaps and vice versa. We have 
that the second block of the LCS is hence obtained from the second blocks of x and y by taking 
the minimum of their respective lengths. In our current special example, we have that for all 
i = 1,2, ... ,6, the i-th block of X gets aligned with the i-th block of y. We could represent 
this idea visually by viewing the alignment as an alignment of blocks in the following manner: 
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Let us next analyze what is the expected change when we perform our random modification. In 
X there are exactly 3 blocks of length 1 — 1 = 2. These are the first three blocks of x. The first 
block of X of length 2 is aligned with a block of y of length 2, the second one with a block of 
length 3 and the fourth with a block of length 4. Hence, when we increase the length of the first 
block of length 2 of x by one the score does not increase. When we increase the second or third, 
however, the score increases by one unit. Each of these blocks has the same probability 1/3 
to get drawn. Hence, the conditional expected increase due to the enlargement of a randomly 
chosen block of length 2 in this case, is equal to 2/3. In our random modification we also 
choose a block of length I + 1 and decrease it to length I. In our example, there are two blocks 
in X of length / + 1 = 4. These blocks are the fourth and fifth block of x. The fourth block is 
aligned with a block of length 2 whilst the fifth is aligned with a block of length 4. Hence, when 
we decrease the length of the fourth block we get no change in score whilst when we decrease 
the fifth we get a decrease by one unit. Each of the two blocks have same probability to get 
drawn. This implies that the expected change due to decreasing a randomly chosen block of 
length 4 is equal to —1/2. Adding the two changes, we find that for x and y defined as in the 
current example, the conditional expected change is equal to: 

E[L„-L„|X = x,y = y] = ^-^ = ^ (2.4) 

In our example we have six aligned block pairs leading to the following set of pairs of lengths: 

{(2, 2); (2, 3); (2, 4); (4, 2); (4, 4); (3, 4)}. 
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Let pij designate the proportion of aligned block pairs which have the x-block having length 
i and the y-block having length j. 



Example 2.3 For our example above we have: 

1 1 



P22 P23 P2A 

P32 P33 P34 \ = \ g I (2.5) 

P42 P43 PU 



With this notation, equality |2.4| can be written as: 

T^i f T w V 1^ Pl-i,l + Pl-iHi 
E Ln- Ln\X = x,Y = y\> — 

pi^ii-i +PI-IJ +pi- 



Pi+i,i+i 



'1,1-1 + Pi-i,i + Pi-i,i+i Pi+i,i~i + Pi+i,i + Pi+i,i+i 



(2.6) 



The inequality 2.6 holds if there exists an optimal alignment a of x and y leaving out no 
blocks, and having a proportion pij of aligned block pairs such that the x-block has length 
i and the y-block has length j (for every i,j G {I — 1,1,1 + 1}). Typically, for large n, the 
optimal alignment will not be like in the example above, but there will be blocks which are 
left out, which implies also that some blocks are aligned with several blocks at the same time. 
Let us check an example: 

Example 2.4 Let x = 00110011100011000 and y = 00001111000011000. In this situation 
the LCS is equal to LCS(x,y) = 000011100011000 and corresponds to the following optimal 
alignment: 
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which in block representation would be: 
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In the last alignment above we see that the first block of y is aligned with the first and third 
block of X. This implies that the second block of x is "completely left out", which means all its 
bits are aligned with gaps. The other blocks are aligned one block with one block: the fourth 
block of x is aligned with the second block of y, whilst the fifth block of x is aligned with the 
third block of y. Finally the last blocks of x and y are aligned with each other. 

In everything that follows, the proportions pij will only refer to the block pairs aligned one 
block with one block. Hence, in the alignment |2.7[ the first three blocks of x and the first 
block of y do not contribute to {pij}i,j. 

Example 2.5 In the last example above there are 4 block-pairs aligned one block with one 
block. The corresponding pairs of block-lengths are: 



(3, 4); (3, 4); (2, 2); ((3, 3) 
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Hence for the alignment 2.1 , we find ^3^4 = 2/4, p2,2 = 1/4, ^3,3 = 1/4 and pij = for all 



^ {(3, 4), (2, 2), (3, 3)}. We will denote by qi, resp. q2, the proportion of left out blocks 



in X, resp. in y. In the alignment 2.1, in the sequence x there is one left out block from a 
total of 7 blocks. This implies that qi = 1/7. There is no left out block in y so that q2 = 0. 
In section we will see that typically, for n large enough, qi and q2 can be taken as close 
to each other as we want to. When qi = q2 we denote the proportion of left out blocks by q. 
When we choose a block of length l — l in x to increase its length we will have to consider the 
probability that the block is not aligned one block with one block. In the alignment \2.1\ there 
are 4 blocks in x of length l — l = 2. The first three are not aligned one block with one block: 
the second is left out, whilst the first and the third block are aligned with the same block of y. 



Hence in 2.12 the proportion of blocks not aligned one to one among the blocks of length 2 is 



3/4. On the other hand, the blocks of length I + 1 = 4 in x are all aligned one to one. So, 



for the alignment 2.1 , we have that the proportion of blocks not aligned one to one among the 



blocks of length A is 0. 

Using some combinatorial arguments, in section [3] we will see that typically the proportion 
among the blocks of x of length / — 1 which are not aligned one block with one block is 
not more than 9q. Similarly for the blocks of length / + 1 in x one gets a bound 3q for 
the proportion of blocks aligned with several blocks of y or left out. We can rewrite the 



lower bound on the right side of inequality 2.4 taking also into account the left out blocks. 



Assuming that there is an equal proportion of blocks q which are not aligned one to one in x 
and in y we get the following lower bound for the conditional expected increase in the LCS: 

Pl-l,l + Pl-l,l+l ,^ „ ^ ^^ -i \ -i /n n\ 

[1 - 9q) ■ ^ (1 - 3q) - 3q (2.9) 



Pl-i,l-i + Pl-i,l + Pl-i,l+i Pl+i,l-i + Pl+i,l + Pl+i,l+i 

The above lower bound for the conditional expected increase in LCS holds assuming that the 
following conditions holds: 

• There exists an optimal alignment leaving out exactly the same proportion q of blocks in 
X and in Y. For that optimal alignment a, let {pij}i,j denote the empirical distribution 
of the aligned block pairs, so that pij = Pij[a). 

• There is exactly the same number of blocks in X and in Y . 

• In X, each block lenght / — 1, /, / + 1 constitutes exactly 1/3 of the blocks. Same thing 
in Y. 

The above conditions do not typically hold exactly but only approximately. We first look 
at this somehow simplified case before looking at the general case (for the general case, see 



the proof of theorem 2.1.3). Let us next explain how we get the bound 2.9 for this somehow 



simplified case (also, the reader should compare it to the version 2.6 with no gaps). Assume 
next that we have an optimal alignment a with given empirical distribution ^ 
of the aligned block pairs and leaving out in both sequences x and y a proportion q of blocks. 
What is now the effect of our random change on the score of the alignment a? First let us 
look at the randomly chosen block of length I — 1 which gets its length changed to L If that 
block is aligned with a block of length / or / + 1 the alignment gets increased by one unit. So, 
conditional that the randomly chosen block of length / — 1 is a block aligned one to one, we 
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get that the probabihty of an increase is equal at least to: 



Pi-i,i +Pi-i,i+i 
Pi-i,i-i +Pi-i,i +Pi-i,i+i 



Now, if the randomly chosen block of length / — 1 is aligned with two or more blocks, then 
we also get an increase by one unit. If the chosen block however is aligned with a block of 
Y which is aligned with several blocks of Y (let us call it a polygamist block), then we have 
no increase. The same happens if the block is not aligned with a block of Y. There are at 
most a proportion of 3q blocks which are not aligned with any block or aligned together with 
polygamist block of Y. There are about a proportion of 1/3 blocks of length / — 1. Hence 
among the blocks of length l — l, there is a proportion of at least l — 9q which are aligned one 
block with one block or aligned one with several. Hence we get that the conditional expected 
change due to changing the randomly chosen block of length / — 1 to / is equal at least to: 

(1_9^). (2.10) 

Pi-i,i-i +Pi-i,i +Pi-i,i+i 

Similarly we can analyze the effect of the randomly chosen block of length I + 1 which gets 
reduced to length /. If the block is aligned one block to one block and the length of the 
aligned block of y is / + 1 then the score can get reduced by one. If the block is aligned with 
a block of Y of length / or / — 1 the score does not get reduced. Hence, given that the block 
of length / + 1 chosen is aligned one block to one block, the conditional expected change is 
not less than: 



Pi+i,i-i +Pi+i,i +Pi+i,i+i 

On the other hand, when the chosen block of length Z + 1 is aligned with several blocks of Y 
then the score goes down by one unit. There are at most a proportion q of blocks of X aligned 
with several blocks of y. So, among the blocks of length / + 1 this represents a proportion of 
at most 3q. Hence we get that at worst the expected change due to changing a random block 
from Z + 1 to / is equal to: 



Pi+i,i+i 



Pi+i,i-i +Pi+i,i +Pi+i,i+i 



{l-3q)-3q (2.11) 



Putting 2.10 and 2.11 together we get that the expected conditional change of the alignment 



score is bounded below as follows: 

T7rA r W ^ Pl-l.l + Pl-l,l+l n N Pl+l.l+1 1^ Q \ Q 

E[ALa|X,y] > ■ (1 - 9g) ■ (1 - 3(7) - 3(7 

Vi-\,i-\ +Pt-i,i +PI-1.1+1 Pi+i.t-i +Pi+i,i +Pi+id+i 

where ALa denotes the change in score of the alignment a due to the random modification 
of X. 



Then, to prove inequality 2.1 in theorem 2.2, it is thus sufficient to show that for all optimal 
alignments a of X and Y, expression 2.9 is positive and bounded away from zero with high 
probability. Hence the next question is how can we prove that typically, for large n, expression 
2.9 is larger than a positive constant not depending on n? 
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Example 2.6 Let us return back to the example of alignment 2.7 That alignment left out 
only one block, and that was the second block of X. We could now proceed in a different 
order. We could first decide which blocks get left out before generating the random sequences 
X and Y. The resulting alignment is in general not optimal. On the other hand, such an 
alignment has the property that the block pairs aligned one to one are i.i.d. This is a very 
nice property for large deviation estimations, for instance. Let us give an example. Assume 
we request that the only left out block is the second block of X (as in alignment 2.1). Assume 
we redraw X and Y and obtain X = 00111001100011000 and Y = 00011110000111000. Then 
we get as alignment and Common Subsequence ( CS) the following: 



X 








1 


1 


1 








1 


1 

















1 


1 













y 



















1 


1 


1 


1 














1 


1 


1 











CS 



















1 


1 

















1 


1 














(2.12) 



which can be represented as an alignment of blocks by: 



X 


0011100 


11 


000 


11 


000 


y 


000 


1111 


0000 


111 


000 


CS 


000 


11 


000 


11 


000 



In this case we use the term of common subsequence instead of the longest common subse- 
quence because we are leaving a block out of the alignment, if we do not leave it out we might 
get a longer common subsequence (which does not happen in this case neither but might 
happens in the general case). So, in this last example, before drawing X and Y, we know 
that the fourth block of X gets aligned with the second block of Y and this aligned pair builts 
the second block in the CS. The length of the second block of the CS has thus length equal 
to mm{Bx4,, By2}- Similarly, before even drawing X and Y, we know that the fifth block 
of X gets aligned with the third block of Y. Hence, we have that the pair of lengths in the 
second block pair is {Bx5, Bys) whilst the third block of the CS has length mm{Bx5, Bys}. 
Note that {Bxa, By2) is independent of {Bx5, By^) and Bxi is independent of By2 whilst 
Bx5 is independent of -By 3. The distribution of each of the blocks Bxi, By 2, Bx5 and By 3 
is unchanged, they take value / — 1, / or / + 1 with equal probability 1/3. Hence, {Bxi, By2) 
can take any of the nine values in the set {{i,j)\i,j = 1 — 1,1,1 + 1} with probability 1/9. 



When we specify an alignment by deciding which blocks we leave out before drawing X and 
Y, the aligned block pairs are "almost" i.i.d. Why do we say "almost"? In the above example 
{Bx4, By2) and {Bx5, By^) are i.i.d. and not just close to be i.i.d. On the other hand, block 
-6x7 in the case |2.7| is no longer in X if the first, third and fourth blocks get each increase by 
one unit. In this sense the blocks are not completely independent. But since we take n large 
this is only a minor effect. We will take care of this detail in section [4] and until then pretend 
that the aligned block pairs are i.i.d. 



Note that for each alignment a defined by specifying which blocks we left out before drawing X 
and Y, the empirical distribution of the aligned blocks is random. We write {-Pij(a)}ije{i-i, 
for this empirical distribution. Thus, Pij{a) denotes the proportion of aligned block pairs 
where the block of X has length i and the block of Y has length j. Given a non-random 
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distribution {pij}ije{i-i,i,i+i} can ask what is the probabiHty for the empirical distribu- 
tion to be equal to the {pij}i,j- The answer is, since the block pairs are close to i.i.d, the 
distribution is close to a multinomial distribution: 



n 



(2.13) 



Pl~i,l~in Pi-i,in^ ■■■ Pi+i,in' Pi+i,i+in' 
where n* designates the total number of aligned block pairs (here we act as if that number 



would be non-random). By using Stirling, the expression 2.13 is approximately equal to: 

g(ln(l/9)+H(p))n' ^2.14) 

where H(p) designates the entropy of the empirical distribution: 

HiP)= PijH^/Pij)- 

i,je{i-i,i,i+i} 

A question arises: for a given aligned block pairs distribution {pij}, is it likely that there 
exist an alignment with that distribution and having a proportion q of left out blocks? Let 
A{q) denote the set of alignments leaving out a proportion q of blocks. Let A denote the 
event that there exists an alignment in A{q) having its empirical distribution equal to {pij}- 
An upper bound for the probability P(j4) is given by the number of elements in A{q) times 



the probability 2.13 By using 2.14 this product is close to: 

|^(^)|.e(in{i/9)+/^(p)K^ (2.15) 

But the size of the set A{q) is approximately equal to g^^^*?)"/', since there are about n/l 
blocks. Hence, expression |2.15| is approximately equal to: 

g(2//(g)n//)+(ln(l/9)+//(p))n* ^ (^2.16) 

If we want the event A to not have exponentially small probability in n, we need the logarithm 

(2.17) 



of 2.16 to be non- negative, which leads to the condition: 

2Hiq) + {l-Aq){ln{l/9) + H{p))>0, 
where we used as lower bound on n* the number (n//)(l — 4g). 



We can now explain how we prove that typically, for all optimal alignment, expression 2.9 
is larger than a positive constant not depending on n. For this we simply need to find a qQ 
so that we can prove that the optimal alignment leaves out at most a proportion q < qo 



blocks and then show that expression 2.9 is bounded away from zero under condition 2.17 for 
[0,go]. 



Let -F" (q) be the event that any optimal alignment of X and Y leaves out at most a propor- 
tion q of bocks in X and leaves out the same proportion q of blocks in Y. In more details, 
given q > and an optimal alignment a of X and Y in F^{q), we can count the number of 
blocks that are left out (not used in a) and divide this number by the total number of blocks 
in X to obtain gi, and also divide this number by the total number of blocks in Y to obtain 
q2, then we know that qi < q and q2 < q- 
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Example 2.7 Let us take again the case where X = 00111001100011000 and 

Y = 00011110000111000, then we have as before the following common subsequence (CS) 

represented in an alignment: 



X 








1 


1 


1 








1 


1 

















1 


1 













Y 



















1 


1 


1 


1 














1 


1 


1 











CS 



















1 


1 

















1 


1 














and represented as an alignment of blocks by: 



X 


0011100 


11 


000 


11 


000 


Y 


000 


1111 


0000 


111 


000 


CS 


000 


11 


000 


11 


000 



Let us compute qi and q2 in this case. For X we have a total of 7 blocks and only 1 block is 
left out in the alignment, so qi = 1/7. For Y we do not have left out blocks so q2 = 0. Then 
given > 0, this alignment belongs to F'^{q) if and only if qi = 1/7 < q and ^2 = < g. 



The next theorem says that if we can bound expression 2.9 away from zero under condition 
2.17, then we have typically the desired bias for E[L„ — Ln\X,Y] the conditional expected 



increase m score: 



Theorem 2.3 Assume that there exists qo E [0, (l/3)[ such that the following minimizing 
problem: 



mm 



Pl^i,l +Pl-i,l+i 



yPl-i,l-i +Pl-i,l +Pl-i,l+i 
under the conditions: 



(1 - 9q) 



pi+i,i+i 



pi+i,i-i +pi+i,i +pi+i,i+i 



[l-3q)-3q] (2.18) 



q G [0,qo],Y,Pl-i,j > ((1/3) - qo)/2 , > ((1/3) - qo)/2 (2.19) 



2H{q) + (1 - Aq) (ln(l/9) + H{p)) > 



(2.20) 



(2.21) 



has a strictly positive solution. Let this minimum be equal to 2e > 0. Then we have that: 



P(^ E[L„-L„|X,y] >e j > 1 
where (3 > is a constant not depending on n. 



P(i^"^(«o)) 



(2.22) 



Note that the high probability of the biased effect is only given when F{F^^{qQ)) is small (re- 
call that F"(go) is the event that in any optimal alignment the proportion of left out blocks 
is less/equal to qo). This means that, in order to apply the above theorem, we first need to 
come up with a way to bound the proportion q of left out blocks in any optimal alignment. If 
the proportion of left-out blocks q is too high, the joint distribution Pij{a) of the aligned block 
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lengths could just be anything. In other words, the entropy condition 2.21 becomes useless 



when qo is not small enough. We can now summarize how to apply the last theorem above: 
we first need to establish that the proportion of left out blocks is small enough. This means 
that we need to find a qq which satisfies that P(-F"(go)) is close to 1 and small enough, so 



that the objective function 2.18 is bounded away from under the constrains 2.19, |2. 20 and 



2.21 In section 3.1, we show that the proportion of left out blocks does typically not exceed 
any qq for which qq > (4/9)/(/ — 1), where I is the average block length. With this bound 
on q, (that is taking go = (4/9)/(/ — 1)) we are then able to verify numerically ?? that the 
objective function 2.18 is bounded away from under our constrains, already for / = 6. This 
then implies that for any I > 6, the order of the fluctuation is VAR[Lfi] = 0(n). In the next 
section, we prove the last theorem precisely, taking care of other details, for example that the 
proportion of left out blocks in X and in Y does not coincide in every alignment, only in the 
optimal alignment. Other important point is that the probability P(-F"'^(g)) depends on the 



parameter /. In chapter 3.1 , we show how to find upper bounds on the proportion of left out 
blocks. In general, for / larger, the bounds gets better. Actually the bounds even converge to 



zero as / goes to infinity. As q goes to zero, expression 2.18 gets close to 1/3 on the domain 



That is why the minimizing problem in theorem 2.3 has a strictly positive solution when I is 
large enough. 



Let us no prove that theorem 2.3 and theorem 2.2 together imply theorem 2.1 



Proof. We suppose that F"^(go) has exponentially small probability for any fixed qo > 
provided I is large enough (see section 3.1 and 4.2). In section |4] we will show how large I 
should be depending on qQ but not on n. The conditions in theorem 2.3 are satisfied when 
go > (hence q < qo small enough) is taken small enough. Let us explain why. First note 
that inequality 1 2. 21 can be written: 



^(p)>^^+ln(l/9) 



(2.23) 



When q goes to zero, then H(q) also goes to zero and so does 2H(q)/l — Aq. But we have 
that H{p) is always less or equal to ln(l/9), with equality iff all the pij's are equal to 1/9. 



It follows that by taking g > small enough, we get condition 2.23 to imply that the dis- 
tribution pij gets as close as we want to the equiprobable distribution. On the other hand, 
when q goes to zero and all the pjj's converge to 1/3, then the quantity 



Pl-l,l + Vl-lH'^ ^-^ _ g^^ Pl+l,l+^ 

Pl~l,l-1 + Pl-l,l + Pl-l,l+l Pl+l,l~l + Pl+l,l + Pl+l,l + l 



, (1 - 3g) - 3g, 



converges to 2/3 — 1/3 = 1/3 > 0. This shows that by taking go > small enough we get 



that the minimizing problem in theorem 2.3, has a strictly positive solution. So, assume that 
go > is such that the following two things hold: 

• -F"'^(go) has exponentially small probability in n. 



• The minimizing problem in theorem 2.3, has a strictly positive solution, 
solution 2e, where e > 0. 



Call this 
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By theorem 2.3, we then have that inequahty 2.22 holds. But since ^"'^(go) is exponentially 
small in n, we get that the expression on the right hand side of inequality 2.22 is smaller than 
exp(— n") for all n and a > not depending on n. This implies that condition 2.1 in theorem 
|2.2| is satisfied. Then theorem 2.2 implies that: 



VAR[L, 



e(n). 



3 Combinatorics of the left out blocks 



In this section we describe some of the combinatorial properties that our block model exhibits 
when looking for optimal alignments. Let us begin with an example: 



Example 3.1 Let X = 0011100 and Y 

the following alignment: 



0001100. The LCS is 001100. This corresponds to 



X 












1 


1 


1 








Y 













1 


1 










LCS 












1 


1 











(3.1) 



In this example, the first block of the LCS has length 2. It is obtained from the first block 
of X and the first block of Y . The first block of X has length 2 whilst the first block of Y 
has length 3. The length of the first block of the LCS is equal to the minimum of these two 
numbers. In this kind of situation we say that the first block of X is aligned to the first block 
ofY. Similarly the length of the second block of the LCS is the minimum of the lengths of 
the second block of X and of Y . We say that in this alignment the second block of X gets 
aligned with the second block of Y . Finally the third block of X gets aligned with the third 
block ofY to yield the third block of the LCS. In this present example no block of X orY got 
left out completely: every block "contributed" some bits to the LCS. All the blocks are aligned 
one block of X with one block ofY. Each such pair of aligned blocks is responsible for one 
block in the LCS. 

In some other cases, some blocks of X and Y are completely left out. Let us look at such a 
situation. 



Example 3.2 Consider X = 00100000111 and Y 
000000011. The LCS corresponds to the alignment: 



00000100011. The LCS would be 



X 










1 





















1 


1 


1 


Y 





















1 











1 


1 




LCS 































1 


1 





(3.2) 



In the last example above we have that the second block of X and of Y are totally left out 
and do not contribute to the LCS. We say that these blocks are left out blocks. The last 
block of X and the last block of Y "get aligned" together to yield the last block of the LCS. 
We say that this is an aligned block pair or also that these two blocks are aligned one block 



to one block. One way of thinking about the LCS defined by the alignment 3.2 above is as 
follows: we first decide which blocks we leave out in X and Y. Then from the two obtained 
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sequences, we align block by block without leaving out any blocks. So the alignment 3.2 can 
be seen as the alignment in which we leave out the second block of X and the second block 
of Y. This gives then the modified sequences X* = 0000000111 and Y* = 0000000000011. 
Then we align X* and Y* block by block. The common subsequence we obtain has its i-th 
block having length equal to the minimum of the length of block i of X* and of Y* . In this 
example we have that the first and the third block of X get aligned with the first and third 
block of Y. By this we mean that in both sequences the first and third block are made into 
one block and these blocks are then matched. We will be able to prove that in the case we 
study here this is untypical: for optimal alignment we will only have one block aligned with 
several at the same time, but not several with several. Let us look at one more example: 



Example 3.3 Let X = 001001111 and Y 

sponds to the alignment: 



000011011. The LCS is 00001111. This corre- 



X 










1 








1 


1 




1 


1 


Y 


















1 


1 





1 


1 


LCS 


















1 


1 




1 


1 



(3.3) 



Here the second block of X is left out. Hence the first and the third block of X get aligned 
with the first block of Y . Similarly the fourth block of X gets aligned with the second and 
fourth block ofY. The third block ofY is left out. 

This situation will happen in optimal alignment: one block aligned with several blocks of the 
other sequence. 



Assume that we know for an alignment a which blocks are left out. Assume that X* , resp. 
Y* denotes the modified sequence X, resp. Y where we left out the specified blocks. Let Z 
denote a common subsequence defined by the alignment a. The alignment must then align 
all the blocks of X* with the blocks of Y* one to one, otherwise there would be more left-out 
blocks. Hence, the first block of X* gets aligned, then the second block of X* and so on. 
If the alignment wants to stand a chance to be an optimal one (and hence Z to be a LCS) 
for each pair of aligned blocks from X* and Y* aligned to one another, it needs to extract a 
maximum of bits of each such pair. Hence, for every z = 1, 2, . . . , j we have that the length 
of the block number i oi Z must be equal to the minimum between the length of the i-ih. 
block of X* and the length of the i-ih. block of Y* (here j denotes the number of blocks in 
Z.) Hence, since we are interested in LCS's (and hence in optimal alignments) we will only 
consider alignments defined in the following manner: first we define exactly which blocks get 
left out. Second we align the resulting sequences X* and Y* one block with one block. The 
next lemma says that in our setting an optimal alignment cannot align several blocks with 
several blocks. 

Another useful fact is that for optimal alignments we do not need to consider adjacent left-out 
blocks except maybe at the end of the sequences. But in section |4] we prove that only a small 
percentage of bits could be left out at the end of X and Y in an optimal alignment. Hence, 
the practical implication is that we only need to consider left out blocks at least separated by 
one non-left out block. Let us first explain what we mean by adjacent left out blocks between 
aligned blocks: 
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Example 3.4 Take x = 11001100 and y = 00001100. Let us align as follows: 



X 












1 


1 








1 


1 








y 
















1 


1 

















(3.4) 



We see a typical situation where the second and third block of x in the alignment above get left 
out (i.e. entirely aligned with gaps). These two blocks are adjacent and they are comprised 
between aligned blocks (i.e. in our example they are comprised between the first and fourth 
block of X which are "aligned", by aligned we mean aligned with another block hence not 
entirely aligned with gaps). The next lemma below states that for our LCS problem (i.e. 

can 



optimal alignments) the kind of situation we face in the current numerical example 3.4 
be discarded. The reason is as follows. In the current example in gets aligned with xj. Now 
instead align j/y with xs and keep all the rest of alignment |3.4| identical otherwise. Then by 
doing this you have not decreased the score but have destroyed the situation of two adjacent 
completely left out blocks. The next lemma shows what we explained in our example in a 
rigorous way: 

Lemma 3.1 There exists an optimal alignment of X and Y having no adjacent left-out blocks 
between aligned blocks. 



Proof. View an alignment as a finite sequence of points in NxN, so that if Xi gets aligned with 
yj, then {i,j) is a point in the set representing the alignment. Introduce for two alignments 
a, 6 G N X N the order relation a < 6 iff all a contains the same number of points as h and if we 
numerate in both sets the points from down left to up right then the i-th point Oj = {aix, diy) 
of a and the i-th point hi = {hix, biy) of b satisfy a-ix < bix and Oiy < biy for all « < |a|. Here 
\a\ designates the number of points in a. Take now an optimal alignment which is minimal 
according to the relation <. That optimal alignment satisfies the property of not having 
several adjacent left out blocks between aligned blocks. ■ 

Next we show the relation between left out blocks at the end of each sequence and the total 
left out blocks in each sequence: 

Lemma 3.2 Let x,y £ {0, 1}" be two sequences of length n. Let the number of blocks of x, 
resp. y be denoted by n\ = (n/l) + Ai, resp. = (n/l) + A2. Assume that |Ai|, IA2I < A. 
Assume also that a is an alignment of x and y which does never leave out adjacent blocks 
except maybe a contiguous group at the very end of x and of y. Let (5i > denote the 
proportion of blocks which are entirely left out at the end of x, resp. y, among all the blocks 
of X, resp. y. Let qi, resp. q2 denote the proportion of blocks left out in x, resp. in y. Then 
we find that: 

AlA 

ki - 92| < l-5|5i - H (3.5) 

n 

Proof. Let x* , resp. y* denote the sequence we obtain after we removed the blocks which 
are completely left out by a. Since there are no other completely left out blocks, we have 
that the number of blocks in x* must be equal to the number of blocks in y* . Note that for 
every left out blocks which has no adjacent left out block the number of blocks is reduced by 



17 



2. for the adjacent left out blocks at the end, for each left out block there is one block less. 
Since there are no adjacent left out blocks except the adjacent blocks at the end, we get that 
the number of blocks of x* , resp. of y* is equal to 

nl{l-2{qi-6i) + 6i)), (3.6) 

resp. 

n^(l-2(g2-<52) + 52)). (3.7) 



Taking the difference of 3.6 and 3.7 and dividing by (l/2n), we find 

gi - 92 = 1.5(52 - (^i) + — (3.8) 

n 

where 

26 = 1 - 2{qi - 5i) + 5i) - (1 - 2(^2 - 52) + 52)) 
we see that h is always smaller than 4 which ends the proof. ■ 

Lemma 3.3 For I > 4 any optimal alignment of X and Y does not align several blocks in 
X with several blocks in Y. 



Proof. Let us explain the idea behind through an example. Let us take x = 0001111000111100000 
and y = 0001111000001110000 two realizations of X and Y, respectively, with / = 4. An 
alignment using all blocks of x and y in block representation becomes: 



(3.9) 



Let us now suppose that we leave out the second block of x and the second block of y, then 
the alignment in block representation looks like: 



X 




000 


1111 


000 


1111 


00000 


y 




000 


1111 


00000 


111 


0000 


LCS 




000 


1111 


000 


111 


0000 



X 




0001111000 


1111 


00000 


y 




000111100000 


111 


0000 


LCS 




000000 


111 


0000 



(3.10) 



One clearly sees that in alignment |3.10 we lost the entire block of I's of length 4 and we did 



not gain any new aligned symbol, so the LCS decreased on 4 units compared to alignment 3.9 
In this particular example, the neighbour blocks of the left out block in y had all together at 
least as many symbols (8 zeros all together) as the neighbour blocks of the left out block in x 
had all together (6 zeros). In general we could gain at most 2 new symbols from the neighbour 
blocks of the left out block but we always loose at least I — 1 symbols leaving a block out and 
aligning its neighbours together instead. The other blocks do not get involved in the change 
on the score. Hence, when one leaves out a block and tries to align the neighbour blocks 
together the LCS changes in 2 — (/ — 1) = 3 — /. Then for blocks of length / > 4, to align 
several blocks with several blocks decreases the LCS rather than to increase it. ■ 
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3.1 Maximum number of left out blocks 



The first key question is tlie percentage of blocks wliicli are at most left out in an optimal 
alignment. Since the blocks have length / — 1, Z or / + 1 with equal probability 1/3 the expected 
block length is I. Hence, the expected number of blocks in a sequence of length n is about 
n/l. Now let us define the limit: 

7. = hm (3.11) 

n— >oo Tl 

Hence, the number of bits in the sequence X (and also in the sequence Y) which are not used 
for the LCS is about (1 — ji)n. Every block we leave out means at least I — 1 non-used bits. 
Hence, the number of left out blocks for long sequences can typically not be much above: 

(1 - 7;)n 
l-l ' 

This represents typically a proportion of: 

(1 - -fi)n/{l - 1) 1 - 7z 



n/l l-(l/0 

from the total number of blocks. Hence we find that the proportion of left out blocks in the 
optimal alignment is typically close or below the following bound: 

^ ~ ^' (3.12) 



1 - (1/0 



Let us next find a simple lower bound for 7; which we can use in expression 3.12 Assume we 
choose an alignment which leaves out no blocks. The typical score of such an alignment gives 
a lower bound for 7;. In this case the common subsequence defined by such an alignment has 
its i-th block having length: 

Bi := mm{Bxi,BYi}- 

where Bxi (resp. Byi) is the length of the i-th. block of X (resp. Y). Recall that Bxi (resp. 
Byi) has uniform distribution on the set {1 — 1,1,1 + 1}. The distribution of the minimum 
above is as follows: 

F{Bi = l-l) = 5/9, P{Bi = l) = 3/9, P{Bi = I + 1) = 1/9. 

The expected length is thus: 

Em = ^-ii-i) + h + ^-{i + i) = i-^. (3.13) 

Since there are about n/l blocks, the score aligning all the blocks gives thus about a score of: 

fE[B.l=„(l-i), 

SO that we obtain: 
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The last inequality together with the bound 3.12 implies that the proportion of left out blocks 
should typically not be much above the following bound: 



1 - (1 - (4/9/)) 4/9 



(1/0 



/ - 1 



(3.14) 



Another similar approach is to get a lower bound for 7^ by simulations. As a matter of fact we 
have for any n that E[L„]/n is a lower bound for 7^. By Montecarlo we can find an estimate 



of E[L„]/n and a very likely lower bound jib- We then replace in inequality 3.12 7; by 'jib 



4 High probability events 



Let 5 > be a parameter not depending on n. We will define a number of events related 
with the combinatorial properties of the optimal alignments, called C", D"'{5), G'^{5) and 
J^{S). In the following we will prove that these events have high probability for n large. By 
high probability, we mean a quantity which is negatively exponential close to one in n. It will 
turn out that this is true for the above events for any parameters 5 > not depending on n. 
Also we will prove that F'^{q) has high probability for n large in the same sense as above but 
restricted to some values of q. All the missing proofs (omitted for shortness reasons) can be 
found with details in 



A very useful tool we often use is the Azuma-Hoeffding theorem. The following is a version 
of it for martingales (for a proof see [29] ) : 

Theorem 4.1 (Hoeffding's inequality) Let (V,^) be a martingale, and suppose that there 
exists a sequence ai, a2, • • • of real numbers such that 

P(|K-K-i| < a,) = 1 

for all n. Then: 

Pi\Vn - Vol > z;) < 2exp { - -v'/ ^1} (4.1) 

i=l 

for every v > 0. 

We also will use a corollary of the above theorem, for some intermediate bounds: 

Corollary 4.1 Let a > be constant and Vi, V2, . . . be an i.i.d sequence of random bounded 
variables such that: 

V{\Vi-F.[Vi]\ <a) = l 
for every i = 1,2,... Then for every A > 0, we have that: 



Vl + --- + Vn 



n 



2 



> A) <2exp( -^-n) (4.2) 
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4.1 Number of blocks as renewal process 

For A; > let us define the sum of the length of the first k blocks in X as: 

Sk = Bxi -\ 1- Bxk 

Let us define the number of blocks used in a sequence of length t in X as: 

= max{A; > : < t} (4.3) 

Note that there might be at the end of X a block which has length smaller than I — 1. Since 
this is at most one block it plays little role and we will not mention it every time, only when 
it is relevant (the same will apply to Y in what follows). 

Due to the standard theory of renewal processes, for every k,t > the following relation 
holds between the two random variables defined above: 

>k^S^ < t. (4.4) 

In the same way we define for Y the same variables as before: 

= Byi + • • • + Byk 

nJ = max{A; > : 5^ < i} 
where still the relation N^^ > k < t, for every k,t > holds true. 

Let C" be the event that the number of blocks in X and in Y lies in the interval 



In 



Lemma 4.1 There exists a constant bi > depending on I such that: 

for every n > large enough. 
Proof. It is easy to see that: 

Cn = {N^ G /„} n {N^ e In} (4.5) 
It is sufficient to compute directly P{{N^ G InY)' 

e InV) <p[n^<j- n'-') +p[n^>j+ n°-^) (4.6) 
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Now let us compute each expression separately. Let mi := — n^-^~\ be an auxiliar variable. 
We have at the beginning: 



P ( iV^ < T - ) < P(^n < rni 



n 



,0.6 



tX 



I 

(by using Ni'yk^ < t) = F (5^ > 

= P 



n) 



(by 4.2 with P(|Bxi - ^1 < 1) = 1) < 2exp 



mi mi 
mi ( n 



2 V mi 



(4.7) 



Now we need to bound mi in order to get the right order for moderate deviations. Let us 
start looking at the following: 



/ n 
\mi 



l\ > V 



> I' 



n 



n - Irfi-^ + I 
1 



1 ) , by using mi < f — n^'^ + 1 



> I' 



> 



n 



0.8 



1 - 


' 4- 


1 




^0.4 


-i) 

n J 




1 









1 1,1 



1 _ _l_ I 1 
, ^ nO-4 ^ n , 



(4.8) 



We have: 



lim I 1 



n 



0.6 



1 > 



Hence for n large enough, the expression on the right hand side of 4.8 is larger than /'^/(4n^'^) 
so that: 

mi-n (4-9) 



Also, for n > large enough we can take: 



mi 



n 



n 



0.6 



n 



n 



n 



2l\ n 
1 + - > 



2/ V n - Al 



(4.10) 
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Finally we can use 4.9, 4.10 in 4.7 to get: 



PiN^<j-n 



0.6 



< 2 exp 



< 2 exp 

< 2 exp 

< 2 exp 

for n > large enough. For the other term, let m2 
do the same as before. We have at the begining: 



mi 

mi 
n 



n 

mi 

I* 
4„0.8 

/4 



4/ 8n0-8 
/3 
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n 



0.2 



n 



0.6 I 



(4.11) 



be an auxiliar variable and 



n 



P (iV„ > y + n"-''j < P(iV^ > m2) 
(by using A^/^ > A: 4^ 5^ < t) = P (5^ < n) 



m2 



m-2 



(by |0] with ¥{\Bxi - /| < 1) = 1) < 2 exp 



m2 f n 
[m^ 



(4.12) 



Now we need to bound m2 in order to get the right order for moderate deviations. Let us 
start looking at the following: 



\m2 ) 

> f 



n 



1 ) , by using m2 < j + n^'^ 



1 



/ + nO-4 



> 



n 



0.8 



1 + 

J- „0.4 



(4.13) 



where the very last inequality was obtained by assuming n large enough and noticing that: 



lim J— 



> 



Also, for n > large enough we can take: 

7712 = 



> 



n 
21 



(4.14) 
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Finally we can use 4.13 4.14 in 4.12 to get: 



0.6 



< 2 exp 

< 2 exp 

< 2 exp 

< 2 exp 



777-2 / n 



77 



4/ A-nP 
/3 



16 



77 



0.2 



(4.15) 



Then combining 4.6 4.11 and 4.15 we obtain: 



P({iV^ G 4}^) < 4 exp ( -n 



,0.2 



]3 
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and by symmetry we finally get: 



P(C;j) < 8 exp -77' 



,0.2 
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for every 77 > large enough. Taking 6i = ^ > the proof is finished. ■ 

4.2 Left out blocks in an optimal alignment 

Let F^[q) denote the already defined event that any optimal alignment of X and Y leaves 
out at most a proportion q of blocks in X as well as in y. 

Lemma 4.2 For any q satisfying q > gf^zYj, we have that there exists j3 > such that: 

P(i^"'(Q)) < e"^" 

for all 77. Note that here q does not depend on n and also f3 > does not depend on n but on 
I and q. 

Lemma 4.3 For every < 5 < j there exists a constant 63 > depending on 5 and I hut 
not on n such that: 

for n large enough. 

4.3 Proportion of blocks in X and Y 

Let X"^ be the sequence X°° taken up to the 777-th block. Similarly, let be the sequence 
taken up to the 777-th block. Let Dm{S) be the event that the proportions of blocks in 
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X™ and of length I — 1, I and / + 1 are not further from 1/3 than 6. Let be the 

event: 

me In 

where we defined the interval /„ = [n/l — nP'^ , n/l + n^'^ ] . 



(1+3S) 
I \ 21 



Lemma 4.4 For every 6 > we have that: 
for n large enough. 

4.4 Number of blocks for an optimal alignment 

Recall that (resp. N^) is the number of blocks in X (resp. in Y) having lenghts in 



{Z — 1,/,/ + 1} as in expression 4.3 Let be the event that the following inequality 

holds: 

n 

Lemma 4.5 For every (5 > there exist two constantsl b/^jb^ > depending on I and on 6 
such that: 

for every n large enough. 

4.5 Cut blocks at the end 

Let J^ {5) denote the event that the proportion of left out blocks at the end of X or y in any 
optimal alignment is at most a proportion 5 of the total number of blocks in each of these 
sequences. As all events before, we want to prove that J^{6) has high probability to happen 
for every 5 > provided n is large enough. We need an extra definition and a previous lemma 
in order to show the high probability of J" (5). 

For an integer number s G [1,"?^] we denote: 

LI := \LCS{XiX2---Xs,YiY2---Y^)\ (4.16) 



Lemma 4.6 Given 5 > 0, there exists a constant c* > not depending on n but on 5 such 
that: 

E[Ln - Li"^"] > c* • n (4.17) 

for every n > large enough. 
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Proof. Given n > and t G [—1, 1] let us define the number ^{t,n) > as follows: 

. _ n\^CS{Xi---Xn+nt,Yi---Yn-nt)\] 

This number 'y(t,n) is a kind of extension for the Chvatal-Sankoff constant 7 (see [T]), or 
more precisely in the case of our paper an extension of 'ji defined as in expression |3.11[ An 
extended motivation for this definition can be found in [30]. For any fixed t £ [—1,1] it is 
known that 7(i, n) converges as n — )• 00 (see [15] or [30]), let us denote that limit by 

7(t) := lim ^{t, n). 

The speed of convergence to that limit is also known due to theorem 2.1 in [15j. This theorem 
says that there exists 0i > a constant not depending on n such that: 

l7(t,n)-7(OI<'^ (4.18) 



for any fixed t E [—1,1] provided n > is large enough. On the other hand, it is known that 
the map t £ [—1,1] 1— >• 7(i) G [0, 1] is concave and symmetric in the origin (see [30]). Hence, 
for every t G [—1, 1] we have 

7(0) > 7(t) (4.19) 
Let us set an auxiliar variable n* as follows: 

6^ 



n := n 1 

V 2 

Note that with the last definition, the inequality 



- /n ln(n) > ln(n*) = "^^^ (4.20) 



holds due to and In(-) being increasing functions. By using the previous definitions, the 
inequality for the speed of convergence 4.18 the concave inequality [4.19 and inequality 4.20 
(following this order), we can write: 

E[L„-Lr^"] = n7(0,n)-n*7(t*,n*) 

/ 0iln(n)\ , / , ^iln(n*: 
> n 7(0) -n* 7(r) + 



( , , QxMn)\ , / , , 6'iln(n* 
> n(7(0)-^)-n*(7(0) + ^ 

= (n-n*)7(0)-^if^ + "*^"("*) 



> (n-n*)7(0) -26* 



n 

n ln(n) 



n 



n6'y{0) nln(n) 
2ui - 



2 ^M 



2 \/n 



> (4,21) 



26 



where the very last inequahty above holds for n large enough, since 



n— >-oo y -y/ri / 4 
To finish the proof we take c* = ^ '^^^^ . ■ 

Now comes the main result of this section which establishes the high probability of the event 

Proposition 4.1 For every 5 > 0, there exists a constant 9 > not depending on n but on 
6 such that: 

for every n > large enough. 

Proof. With the notation as in 14.161 we write: 

P(J""(5)) < 2F{\LCSiXi---Xn^Sn,Yi---Yn)\-Ln>0) 

= 2P{ Ll-^" - Ln>0) 

= 2P(L^-^"-L„-E[L^-'^"-L„,] > E[L„-L^-''" ]) (4.22) 

Let us define 

It is not difficult to see that M„((5) is db nictrtiiigalG with respGct to th.6 filtration — 
a{{Xkj Yk) : k < n} and that Mq = 0. The following inequality also holds: 

\Mn{d) - Mn-l{6)\ < A 

for 6 > with probability 1. So, we can use the theorem |4.1| (Azuma-Hoeffding inequality 
for martingales) with aj = 4 and v = E[Ln — L^~^"'] to estimate: 

/ ,,2 

P(L5'-'5"-L„-E[L^-^"-L„] >7;) < 2exp' 



2 • 4n 

2 exp 



n 



(by 4.17 and c* from lemma 4.6) < 2 exp I — - ■ n 



Taking = % > finishes the proof. 
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4.6 Optimal events 



In theorem |2.3[ q represents the proportion of left out blocks in X and in 1". In reality, 
typically, the proportion of left out blocks in X will not be exactly equal to the proportion 
of left out blocks in Y . Because of this, qi will designate the proportion of left out blocks 
in X and q2 will designate the proportion of left out blocks in Y . We will have that qi can 
be made as close to q2 as we want to by taking a large n. Now we need to rewrite all our 



conditions as in theorem 2.3 with qi and §2 instead of q. 



Let us define the following events: 

• Given any mi,m2,qi,q2, let Emi,m2,qi,g2i^) denote the event that there is no optimal 
alignment of with Y"^^ leaving out a proportion of qi blocks in X"^'^ and a propor- 
tion of q2 blocks in Y™"^ and such that: 

H{qi) + H{q2) + (1 - max{qi + 3q2, 3qi + ga}) (ln(l/9) + H{p)) < -e. (4.23) 

• Let S"(e) be the event : 

E^ie)= fl Err^,,m,,,,,M. (4.24) 

If 5 designates the difference between qi and q2, then note that the system 



mm 



6 + 2{qi - 6) 



pi-i^l / _ 3qi _^ 

Pi-i,i-i + Pi-i,i + Pi-i,i+i V (1/3) - S 

1 + 5 
(1/3) -5 



Pi+i,i+i 



(1/3 + 6) J pi+i,i-i + pi+i,i + pi+i,i+i 



Hiq,) + H{q2) + il 



\qi - 921 

max{gi + 3^2, 3(71 + 92}) (ln(l/9) + H{p)) 



< 
> 
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(4.25) 



converges to the conditions in theorem 2.3 when 5 goes to zero (when qi is as close to q2 
as we want to by taking a large qi). Note also that replacing qi and q2 by q and taking 
5 = in the minimized function and in the last inequality of 4.25, they become equal to 2.18 



respectively 2.21 If the minimizing problem in theorem |2.3 has a strictly positive solution 
2e and if expression 2.18 is less than ei, this implies that |2.21 must be smaller than a —62 



for €2 > (we are assuming that 2.19, 2.20 and 2.21 hold). The next lemma shows that the 



same holds true for the system 4.25 if we take 6 small enough. 

< (1/3) and ei > such that for all {pij}i,j and 



Lemma 4.7 Assume there exists < q 
q £ [0, qo] satisfying all the conditions 
expression 2.18 is larger or equal to 2ei 



2.19, 2.20 and 


2.21 


in theorem 


2.3 



we have that 

(in other words, the condition that the minimizing 
problem in theorem 2.3 has a strictly positive solution 2ei is satisfied). Then, we have that 
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there exists £2 > and (5o > such that for all and qi,q2 £ [0,90] <^nd 

6 £ [0,60] satisfying 2.19 and 2.20. we have that if \qi — 52! < 25o and if 



Pl-i,l-i + Pl~i,l + Pl-i,l+i 



3qi 



60 + 2{qi - 5o) 



Pl+i,l+^ 



(1/3) - do 
1 + So 



(1/3 + 60) J + pi+i,i + Pi+i,i+i 



q2 ■ 



(1/3) - 60 



+ 



< ei 



(4.26) 



then 



H{qi) + H{q2) + (1 - max{gi + 8(72, 3(?i + ^2}) (ln(l/9) + H{p)) < -e^ (4.27) 
Proof. We are going to do the proof by reductio ad ahsurdum (reduction to the absurd). 



Assume for this that for all {pij)ij^i and q G [0, go] satisfying all the conditions 2.19, 2.20 



and 2.21 in theorem 2.3 we have that expression |2. 18 is larger equal to 2ei. Assume that the 
rest of the lemma would not hold. Then for every 6 > (as small as we want) we could find 
a vector p: 

P ■= {Pl-i,l-i,Pl-i,l, ■ ■ ■ ,W+l,/+l,9l,92,<5) 



such that the components satisfy l^i — 52! < and the components of p satisfy 2.19, 2.20 
whilst inequality 4.26| is satisfied and we can take the expression 



H{qi) + H{q2) + (1 - max{(?i + 3q2, 3qi + 92}) (ln(l/9) + Hip)) 



(4.28) 



as close to zero as we want. Hence there exists a sequence pi,p2, . . . ,pt, . ■ ■ of such vectors 
with notation: 

p{t) := {pi_i^i_i{t),pi_i^i (t), . . . , Pi+i,i+i (t) , qi (t) , q2{t) ,6{t)) 



so that for each t G N the vector p{t) satisfies all the conditions 2.19, [2.20 and 4.26, whilst 

lim \qi{t) - q2{t)\ = 



and expression |4.28 converges to zero as t goes to infinity. 

The vectors p{t) are contained in a bounded domain and hence in a compact domain. This 
implies that there exists a converging subsequence. Hence there exists an increasing map 
vr : N — )• N so that p(7r(t)) converges as t goes to infinity. Let the limit be denoted by 

p := {pi-i,i-i,pi-i,i, . ■ . ,pi+i^i+i,qi,q2,0). 



We have that qi = q2, so let us denote qi 
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q. We find that our limit satisfies all 



the conditions 2.19, 2.20 Furthermore, at the limit expression 4.28 becomes equal to zero. 
Replacing then qi and q2 in 4.28 by q, we find that condition 2.21 is satisfied, finally since for 
our sequence p{-K{t)), we have that 4.26 is satisfied, by continuity it must also be satisfied for 



the limit. Hence, noting that at the limit qi = q2 = q and 6 = 0, we get that expression 2.18 



is less or equal to ei. This contradicts our assumption, since our limit vector satisfies all the 
conditions 2.19, 2.20 and 2.21 and should thus have expression 2.18 larger equal to 2ei. Hence, 
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we have that when all the conditions 2.19 and 2.20, are satisfied and when 5 goes to zero then 



expression 4.26 should be bounded away from zero. This means that for (5 > small enough 



and when all the conditions 2.19 and 2.20 are satisfied, we have that there exists e2 > so 



that 4.26 is less or equal to —62. 



Let us now show that event E., 
Lemma 4.8 Assume that there exists < 



■mi,m2,gi,g2(e) holds with high probability. 



< (1/3) and ei > such that for all {pij} 



and q G [0, go] satisfying all the conditions 2.19, 2.20 and 2.21 in theorem 2.3, we have that 



expression \2.18 is larger or equal to 2ei . Then, for every e > there exist a polynomial 
w{n) > and a constant > 0, both only depending on I such that: 



P(^"'(e)) < w{n)e 



for every n large enough. 



Proof. Let a denote an alignment of the X"^'^ and Y"^'^. Hence a consists of two binary 
vectors a = {ax, ay) the first one having length mi and the second one having length m2. 
Hence Sx € {0,1}"^^ , ay G {0,1}"^^ when the i-th entry axi of ax is a 1 that means that 
the i-th block of X^i is discarded (entirely aligned with gaps)by the alignment a, otherwise 
the i-th block of X'"^'^ is not discarded. Similarly when ayi = 1 then the i-th block of Y"^^ 
is discarded. Here we use the same way of defining alignment as explained before in the 
first section: we specify which blocks get entirely discarded and then align the rest block by 
block. Doing so and assuming that the alignment a is not random, we get that the aligned 
block pairs are i.i.d.. For the lengths of aligned block pairs we have nine possibilities each 
having the same probability. Hence, given the alignment a, the empirical frequencies of the 
aligned block pair lengths is simply a multinomial distribution. Let p = 
be a (non-random) probability distribution. Let Ea{p) denote the event that the empirical 
distribution of the aligned block pairs by the alignment a is not p. 

From what we said we have that the probability P(E^(p)) is equal to the probability that a 
9-nomial variable with parameter m* and all probability parameters equal to 1/9 gives the 
frequencies given by p. Here m* designates the number of aligned block pairs by a. Hence, 
we get 



where 



m^Pl-iJ-i m^Pl-iJ 



a 



'm^Pi+1,1+1 



(4.29) 



a! 



^ai...akj ai!' 
is the multinomial factorial coefficient. Let us define 



B{p) :-- 
M{p) :-- 

Hip) :-- 



m 



Pl-i,l-iin^ . . . Pi+i,i+im^ 

n p^' 

Pie{pi_i,i_i,...,Pi+i,i+i} 



E 



Piln{l/pi) = In 



Pi<^{pi-i,i-i,---,Pi+i,i+i} 



M{p) 



(4.30) 
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note that B{p) ■ {M{p))"^ is the probabihty distribution of a multinomial random variable 
with parameters m* and vector (m*p/_i^i_i, . . . , Hence 

B{p) ■ (M(p))'^* < 1. (4.31) 

Then, by using 4.31 we can bound expression |4.29| as follows: 



< 



1/9 



M{p) 



exp 



ln(g)+i/(p) 



m* (4.32) 



On the other hand, we have at least (1 — max{gi + 3q2, 3qi + (72}) min{mi, 7712} aligned block 
pairs. Let us give an intuition for this. There are three situations for aligning a fixed block in 
X with blocks in Y. First, when we align one block in X with one block in Y one to one, the 
resulting length contributing to the LCS is the minimun between their lenghts, so at most if 
all the blocks of X and Y are aligned one to one then we will have at most a contribution 
of min{mi,m2} aligned blocks pairs. Second, when we align one block in X with several 
blocks in Y then we at least leave qi ■ mi blocks in X. Third, when know that we cannot 
align two adjacent blocks in X with the same block in Y, then we leave at least 2qi ■ mi 
blocks in X also. In total, in the worse case, looking first at blocks in X, we are leaving 
{3qi + (72) minjmi, 7712} blocks in both sequences X and Y. Similarly, but looking first at Y, 
we can leave {3q2 + ^i) min{mi,m2} blocks in both sequences X and Y. Finally, at least we 
have (1 — maxjgi + 3q2, 3qi + 52}) minjmi, 771,2} aligned block pairs due to the considerations 
above. 

Since 7711,7772 G In, this gives the lower bound for m* 

777* > (1 - max{gi + 3^2, 3gi + 92}) • {{n/l) - n°-^). (4.33) 

and hence together with the bound |4.32[ we obtain 

PiiE^aip)) < exp ( (ln(l/9) + H{p)){l - max{qi + 3q2, 3qi + g2})((n/0 - n^'^) ) (4.34) 

Let •A.mi,m2,qi,q2 denote the set of all alignments aligning X"^'^ with y™'^ ^nd leaving out 
a proportion of qi blocks in and a proportion of q2 blocks in Y"^^ . In other words, 

the set •Ami,m2,qi,g2 is the set of all elements a = {SxjSy) of {0, l}™i x {0, l}"*^ for which 
\Sx\ = Qifni and |ay| = q2fn2- 

Let Ve,qi,q2 dcuotc the set of those distributions p (for aligned block pairs, hence on the space 



^ = {{I — — — . . . ,{l + 1,1 + 1)}) for which inequality 4.23 is satisfied and which 



are possible in our case. Before we continue with the proof, let us look at an example: 

Example 4.1 Assume we look at binary strings of length 5. Then there can he 0,1,2,3,4 or 5 
ones. Hence, the empirical distribution for side one when we flip a coin exactly five times can 
only be 0, 20%, 40%, 60%, 80% or 100%. In general for a string of length n and k symbols, 
there are no more than (n + l)*'~^ possible empirical distributions (see [?], Lemma 2.1.2 (a)). 
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In the case above we have an empirical distribution for m* aligned block pairs. For each block 
pairs there are 9 possibilities. Hence, there are no more than (m* + 1)^ possible empirical 
distributions. However m* is not known. It could potentially take on any value between 1 
and (n/l) + n^'^. Hence, we find that for the number of empirical distributions we need to 
consider the following upper bound: 

{{n/l) + n°-6) • {{n/l) + n^'^ + 1)* < {{n/l) + n^-^ + if. 



Let us continue with the proof. We have that: 

^a{p) — ^mi,m2,qi,i}2 (^) 

Oi^.A.jri'^ ,m2 ,91 ,92 '^^i*?! '92 

and hence: 

P(^^,,m2,,i,</2(e)) < E P(^a(p))- (4.35) 

fl£-4.m| ,m2 ,gi ,92 ,peP6,<ji,<j2 



By using 4.34, the inequahty 4.35 above becomes: 



P(^™„rn„,„,.(e)) < E ( (1^(1/9) + H{j>)){\ - max{qi + Sg^, 3gi + g2})((n/0 - ) • 

a^=-Ami ,m2 ,91 ,92 '^'^^e, 9^,92 

Note that the number of ahgnment considered in the sum on the right hand side of the last 
inequahty above can be bound as fohows: 



ni2,gi,(j2 I 



mi \ / 7712 

g'i77ii(l - 91)7711/ V92"72(l - q2)m2 



= ex.p{H{qi)mi + H{q2)m2) 

< exp{{H{q,)+H{q2))m*) 

< exp((/7(gi) + F(g2))((n//) + 770-6)) (4_3g) 

where for i = 1 , 2 we denote 

Hiqi) := q, ln(l/g,) + (1 - q.,) ln(l/(l - q,)). 

The number of distributions in T'e,qi,q2 we need to consider is (as explained above) less or 
equal to {{n/l) + n^'^ + 1)^. Combining all of this we find that P{Emi,m2,gi,g2(^)) ^^^^ 
equal to: 

exp((i/((7i)+i/(g2))((r7/0+?^"-'))-6-exp ( (ln(l/9) + i7(p))(l - max{gi + 3g2, 3gi + q2}){{n/l) - n"-^) ) 
where b := {{n/l) + n^'^ + 1)^. In other words, we found that: 

P(^™i.m2,«i,«2 < ^exp (y (i?((7i) + H{q2) + (ln(l/9) + H(p)){\ - max{gi + 3g2, Sgi + 92}) + r) ) 

(4.37) 
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where the rest term r is equal to: 

r = In-"^-^ {H{qi) + H{q2) - (ln(l/9) + - max{gi + 3q2, 3qi + 92})) 

being bounded as follows: 

\r\ < ln-^-\\H{q,)\ + + (| ln(l/9)| + \H{p)\)) = ln''-\3 + \ ln(l/9)|). 

Note that r is bounded from above by a constant times n"*^'^ where the constant does not 
depend on l,qi,q2,p. Hence for n large enough: 



r < e/2 



(4.38) 



Note also that in the sum 4.35, we only took distributions p S Ve,qi,q2 hence satisfying 
inequality |4.23 This implies that in the bound 4.37, we can assume that inequality 4.23 
holds. This then implies 



rrt2,gi,(?2 



(e)) < 6exp 



n 



Assuming now that 4.38 holds, we obtain: 



,(e)) < 6exp 



n 



-e + r) 



-e/2) 



(4.39) 



(4.40) 



Note that the bound on the right side of the last inequality above is negatively exponentially 



small in n, since b is an expression which is only polynomial in n. Using the equation 4.24 
we obtain: 

"ii,'n2e-f",gi,'72 



Applying inequality 4.40 to the last inequality above, we obtain: 



P(£^"'(e)) < Yl ^^^P 



'Til,'Tl2G/",<Jl,<j2 



-6/2) 



(4.41) 



Note that when mi is given, the number of possibilities for the number of left out blocks in 
X"^^ is at most mi. Hence, for given mi we have that qi can take on at most mi values. 
Similarly for given 1712 we have that q2 can take on at most m2 values. But mi and m2 
are less then {n/l) + n^'^. Also, both mi and m2 are in hence they can take on at most 
2n^'^ values. This implies that in the sum 4.41 , the number of terms is bound above by the 



expression: 



0.6a2 .1.2 



{{n/l)+n°-^)4n 



This upper bound applied to inequality |4.41| yields: 

P(^"'(e)) < b {{n/l) + n°-6)^ 4ni-2 exp (y (-e/2) 
which is the negative exponential upper bound we where looking for. 



(4.42) 
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4.7 Positive expected change in the score 

Let us recall the events that we have proven to have high probability: 

• C" is the event that the number of blocks in X and in Y lies in the interval 



In 



• D'^{5) is the intersection 

where Dm{S) is the event that the proportion of blocks in X"^ and in of length 
1 — 1,1 and / + 1 are not further from 1/3 than 6, where X"^ (resp. y") denotes the 
sequence taken up to the m-th block (resp. the sequence Y°^ taken up to the m-th 
block). 

• F'"'{q) is the event that any optimal alignment of X and Y leaves out at most a propor- 
tion q of blocks in X as well as in y. 

• is the event that the following inequality holds: 

where (resp. N^) is the number of blocks in X (resp. in Y) having length in 
{1-1,1,1 + 1}. 

• E^{e) is the intersection 

-^""(e) = Pi -E^mi,m2,(?i,q2(e) 

mi,m2e/n ; <Ji,g2G[0,l] 

where -E'mi,m2,cji,<j2 (^) the event that there is no optimal alignment of X^^ with 
leaving out a proportion of qi blocks in X™^ and a proportion of q2 blocks in y'"2 a^d 
such that: 

H{qi) + H{q2) + (1 - max{gi + 3^2, 92, 3gi})(ln(l/9) + H{p)) < -ea 



m2 ^ 



where £2 > depends on e,6o and qo and comes from lemma 4.7, X™''^ (resp. Y 
denotes the sequence X°° taken up to the mi-th block (resp. the sequence Y°° taken 
up to the m2-th block) and H{p) denotes the entropy as in |4.30 for an alignment. 

We can now formulate our combinatorial lemma based on those events: 



Lemma 4.9 Let us consider the constants qo,ei,6o and 62 from lemma ^.?. Assume that 
C", D'^ido), -F"(go), ^"((50) and ^"(£2) all hold. Then, we have that 

E[Ln-Ln\X,Y]>ei 
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Proof. For any x,y £ {0, 1}" let L{x, y) denote the length of the LCS of x and y. Let 

now x,y G {0, 1} be any two realizations so that if X = x and Y = y, then the events C", 
D"'{6q), F"((/o) and £^"(£2) all hold. Let a be a left most optimal alignment of x and y. Let 
X denote the sequence x on which we performed our random changes. That is x is obtained 
by selecting a block of length Z — 1 at random and changing it to length I and also selecting 
a block of length Z + 1 at random and reducing it to length I. Let x* be the sequence we 
obtain by applying to x only the first one of the two random changes. That is x* is obtained 
be increasing the length of a randomly chosen block of x of length I — 1 to length I. So, we 
start with x. Then we apply the first change and obtain x*. Then in x* we choose a block of 
length / + 1 at random, decrease it by one unit to obtain x. 

For all i, j E {I — 1,1,1 + 1}, let pij denote the proportion of aligned block pairs with lengths 
in the alignment a of x and y. Let qi, resp. q2 denote the proportion of blocks not 
aligned by a in x, resp. in y. Let pj_^ denote the proportion of blocks which get aligned by a 
one block to one block, among all blocks of x of length 1 — L Let p^^^ denote the proportion 
among all blocks of x of length I — 1 which are aligned with several blocks of y. Finally, let 
Pi-i denote the proportion among the blocks of length I — 1 in x which are left out or are 
together with other blocks of x aligned with the same block of y. Note that when we increase 
by one unit a block in this third category, then in general the score does not get any increase. 
On the other hand, assume that the block of x length I — 1 chosen randomly and increased 
by one unit, is aligned one block with one block. Then if this chosen block is aligned with a 
block of length Z or Z + 1 the score is going to increase. Let G;_i^/ be the event that the block 
of length Z — 1 chosen is aligned one block with one block. From what we said it follows that: 

P(L(x*,y)-L(x,y) = l|G,_,,,)> Pl-^^l+P^-^^l^^ 



Pl-i,l-i + Pl-i,l + Pl-i,l+i 



Note that by only adding a bit the score cannot decrease, so that the last inequality above 
means: 

nL{x\y) - L{x,y) \ > ^'"^^ + ^'"^'^^ . (4.43) 

Pl-i,l-i +Pl-i,l +Pl-i,l+i 

When the block of length Z — 1 chosen and increased is aligned with several blocks of y at the 
same time, then we will always observe and increase of one unit. This yields: 

E[L{x*,y) - L{x, y) \ Gliji] = 1, (4.44) 

where Gi^uj denotes the event that the chosen block of length Z — 1 is aligned with several 
blocks of y. By law of total probability we find thus: 

E[L{x\y)-L{x,y)] > P{Gi_,j)^hlA±Phhm +P{Gi^,,n) 

Pi-i,i-i +Pl-i,i +Pi-i,i+i 

> (l-HGi-^jn)) 

Pi-i,i-i + Pi-i,i + Pi-i,i+i 

where Gi-iju denotes the event that the block of length Z — 1 chosen is left out or aligned to 
the same block of y at the same time as other blocks of x. The last inequality above yields: 

E[L{x*,y)-Lix,y)] > (1 - piL{) Pi-^^ + Pi-W _ (4 45^ 

Pl-i,l-i + Pl-i,l +Pl-i,l+i 
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Note that the proportion of left out blocks in x is qi. There can not be two adjacent blocks 
of X aligned with the same block of y (this is so because a is an optimal left most alignment, 



see lemma 3.1 ). So between blocks of x aligned with the same block of y, there is at least one 
left out block of x. Hence the maximum proportion of blocks of x, which are aligned at the 
same time as other blocks of x to the same block of y, can not exceed twice the number of 
left out blocks of x. This yields a lower bound equal to 2qi. This is as a proportion among 
all blocks in x, but we are interested in the number as a proportion of the total number of 
blocks of length / — 1 of x. So, we get as lower bound 2qi/pi_i, where is the proportion 
of blocks of X which have length I — 1. Adding the blocks in x which are left out and the 
blocks which are aligned with several blocks of y, we get: 

piL{ < ^. (4.46) 
Pi-i 



By Z)"((5o), we have that pi-i > (1/3) — so that together with 4.46, we obtain: 



Pi~i S 



(1/3) -<5o- 



By using the above inequality in 4.45 we obtain: 



E[Lix*,y)-Lix,y)] > P^-^+P^-^ (i _ . (4.47) 

Next we are going to investigate the effect of decreasing a randomly chosen block of length 
/ + 1 by one unit. The score can decrease when the selected block of x of length / + 1 is 
aligned with a block of length / + 1 of y. If it is aligned with one block and that block has 
length ^ or / — 1, then there is no decrease. This leads to: 

E[L(x,y)-L(x*,y)|G,+i,,] > Pi+W ^ 

Pi+i,i-i +Pi+i,i +Pi+i,i+i 

where GiJ^ij denotes the event that the block of length / + 1 chosen is aligned one block with 
one block. When the selected block of x of length / + 1 is aligned with several blocks of y 
then the score decreases by one unit. When the selected block of length / + 1 in x is left out 
or is aligned at the same time as other blocks of x to the same block of y then there is no 
decrease. This leads to: 

E[L(i,y)-L(x*,y)]>-P(Q+i,,) Pi+W P(Q+i,,,), (4.48) 

Pi+i,i-i +Pi+i,i +Pi+i,i+i 

where Gi^ijj denotes the event that the selected block of length / + 1 is aligned with several 
blocks of y at the same time. Let pi+i denote the total proportion of blocks of length / + 1 
among all blocks of x. Let pi+ij denote the proportion among all blocks of x of length / + 1, 
of blocks which are aligned one to one. There is a proportion of qi totally left out blocks in 
X. At most a proportion 5o are at the end of the alignment a contiguous group of left out 
blocks. That means, (assuming qi > 5o), the proportion of left out blocks in x which are 
not adjacent to another left out block of x is at least qi — Sq. Going with each left out block 
which is not adjacent to another left out block, there is at least one adjacent block which is 
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aligned together with several other blocks of x to the same block of y. This gives a lower 
bound for the blocks of x which are not aligned one block to one block of 5Q + 2{qi — 5q). This 
is taken as a proportion among all blocks of x. This gives among all blocks of length / + 1 a 
proportion of at least: 

6q + 2(gi - 5o) 
(l/3 + <5o) ' 

since by the event D'^{5q) we know that among all blocks of x the proportion of the blocks 
of length Z + 1 is less than (1/3) + (Jo- Hence, 



P(G 



1+1,1, 



< 1 



5o + 2(gi - ,5o) 
(1/3 + <5o) ■ 



(4.49) 



Next let us note that we can give an upper bound for the number of blocks of x aligned with 
several blocks of y. Since we never have several blocks aligned with several blocks, we have 
that the number of blocks of x aligned with several blocks of y is not more than the total 
number of left out blocks of y. This is so because between two blocks aligned with the same 
block there is always at least one left out block. The proportion of left out blocks in y is q2- 
but this is taken as proportion among all the blocks of y. Since the total amount of blocks in 
X and y could not be exactly the same, that number can get slightly changed when we report 
it as proportion of the total number of blocks in x. Let pi+i denote the proportion among 
the blocks of x which are of length / + 1 . We have thus that the probability to select a block 
of length I + 1 oi X which is aligned with several blocks of y is less or equal to 



P(Gi+i,,,) < 



Pi+i 



By the event D'^{5o) we have 
and by the event G"{6o) we have 



Pi+i ^ 3 - '^0 



Applying now 4.51 and 4.52 to 4.50 we find 



P(Q+l,//) < q2 



:i/3) - '^o 



(4.50) 

(4.51) 
(4.52) 

(4.53) 



Finally, using inequalities 4.53 4.49 in 4.48 we get: 



E[L{i,y)-Lix*,y)] > 



V (l/3 + (5o) ) pi+i^i-i + pi+i^i + pi+i^i+i 

1+^0 

(l/3)-<5o' 



(4.54) 
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Using inequalities 4.47| and |4.54 together we find: 



E[L(x,y)-L(x,2/)] > E[L(x, y) - L(x*, y)] + E[L(x*, y) - L(:r, y)] 

pi-i,i + pi-i,i+i A 3qi 



> 1 

Pi~i,i-i + Pi-i,i + Pi~i,i+i V (1/3) -5o 
6Q + 2{qi-6o)\ Pi+i,i+i 



- 1 



-^2 



(l/3 + <^o) 



Pi+i,i-i +Pi+i,i +Pi+i,i+i 



(1/3) -<5o' 



Note next that we can apply lemma 3.2 with A 
to E^{5o). Hence we find that: 



n 



0.6 



(4.55) 



because of C" and 61,82 < Sq thanks 



+ 



\qi - q2\ < i-5| 

We assume that n is large enough so that: 

\qi - q2\ < 2|(5o|. 



4/n°-' 



n 



With the last inequality holding, we get from lemma 4.7 that if inequality |4.26 holds, then 



4.27 should be satisfied. By the event £'"(£2), we have that 4.27 can not be satisfied. Hence, 

is 



the inequality |4.26 cannot hold, which implies that the expression on the left side of 4.26 
larger or equal to ei. Together with inequality 4.55[ this implies that: 



E[L{x,y)-L{x,y)]>ei. 
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